Goto

Collaborating Authors

 ethical guideline




Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs

Jiang, Houcheng, Zhao, Zetong, Fang, Junfeng, Ma, Haokai, Wang, Ruipeng, Deng, Yang, Wang, Xiang, He, Xiangnan

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown strong performance across natural language tasks, but remain vulnerable to backdoor attacks. Recent model editing-based approaches enable efficient backdoor injection by directly modifying parameters to map specific triggers to attacker-desired responses. However, these methods often suffer from safety fallback, where the model initially responds affirmatively but later reverts to refusals due to safety alignment. In this work, we propose DualEdit, a dual-objective model editing framework that jointly promotes affirmative outputs and suppresses refusal responses. To address two key challenges -- balancing the trade-off between affirmative promotion and refusal suppression, and handling the diversity of refusal expressions -- DualEdit introduces two complementary techniques. (1) Dynamic loss weighting calibrates the objective scale based on the pre-edited model to stabilize optimization. (2) Refusal value anchoring compresses the suppression target space by clustering representative refusal value vectors, reducing optimization conflict from overly diverse token sets. Experiments on safety-aligned LLMs show that DualEdit improves attack success by 9.98\% and reduces safety fallback rate by 10.88\% over baselines.


Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction

Bravansky, Michal, Kubon, Vaclav, Hariharan, Suhas, Kirk, Robert

arXiv.org Artificial Intelligence

Interpreting data is central to modern research. Large language models (LLMs) show promise in providing such natural language interpretations of data, yet simple feature extraction methods such as prompting often fail to produce accurate and versatile descriptions for diverse datasets and lack control over granularity and scale. To address these limitations, we propose a domain-agnostic method for dataset featurization that provides precise control over the number of features extracted while maintaining compact and descriptive representations comparable to human expert labeling. Our method optimizes the selection of informative binary features by evaluating the ability of an LLM to reconstruct the original data using those features. We demonstrate its effectiveness in dataset modeling tasks and through two case studies: (1) Constructing a feature representation of jailbreak tactics that compactly captures both the effectiveness and diversity of a larger set of human-crafted attacks; and (2) automating the discovery of features that align with human preferences, achieving accuracy and robustness comparable to expert-crafted features. Moreover, we show that the pipeline scales effectively, improving as additional features are sampled, making it suitable for large and diverse datasets.


Delegating Responsibilities to Intelligent Autonomous Systems: Challenges and Benefits

Dodig-Crnkovic, Gordana, Basti, Gianfranco, Holstein, Tobias

arXiv.org Artificial Intelligence

As AI systems increasingly operate with autonomy and adaptability, the traditional boundaries of moral responsibility in techno-social systems are being challenged. This paper explores the evolving discourse on the delegation of responsibilities to intelligent autonomous agents and the ethical implications of such practices. Synthesizing recent developments in AI ethics, including concepts of distributed responsibility and ethical AI by design, the paper proposes a functionalist perspective as a framework. This perspective views moral responsibility not as an individual trait but as a role within a socio-technical system, distributed among human and artificial agents. As an example of 'AI ethical by design,' we present Basti and Vitiello's implementation. They suggest that AI can act as artificial moral agents by learning ethical guidelines and using Deontic Higher-Order Logic to assess decisions ethically. Motivated by the possible speed and scale beyond human supervision and ethical implications, the paper argues for 'AI ethical by design', while acknowledging the distributed, shared, and dynamic nature of responsibility. This functionalist approach offers a practical framework for navigating the complexities of AI ethics in a rapidly evolving technological landscape.


Distribution of Responsibility During the Usage of AI-Based Exoskeletons for Upper Limb Rehabilitation

Huaxi, null, Zhang, null, Fontaine, Melanie, Huchard, Marianne, Mereaux, Baptiste, Remy-Neris, Olivier

arXiv.org Artificial Intelligence

The ethical issues concerning the AI-based exoskeletons used in healthcare have already been studied literally rather than technically. How the ethical guidelines can be integrated into the development process has not been widely studied. However, this is one of the most important topics which should be studied more in real-life applications. Therefore, in this paper we highlight one ethical concern in the context of an exoskeleton used to train a user to perform a gesture: during the interaction between the exoskeleton, patient and therapist, how is the responsibility for decision making distributed? Based on the outcome of this, we will discuss how to integrate ethical guidelines into the development process of an AI-based exoskeleton. The discussion is based on a case study: AiBle. The different technical factors affecting the rehabilitation results and the human-machine interaction for AI-based exoskeletons are identified and discussed in this paper in order to better apply the ethical guidelines during the development of AI-based exoskeletons.


Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search

Moss, Robert J.

arXiv.org Artificial Intelligence

Eliciting harmful behavior from large language models (LLMs) is an important task to ensure the proper alignment and safety of the models. Often when training LLMs, ethical guidelines are followed yet alignment failures may still be uncovered through red teaming adversarial attacks. This work frames the red-teaming problem as a Markov decision process (MDP) and uses Monte Carlo tree search to find harmful behaviors of black-box, closed-source LLMs. We optimize token-level prompt suffixes towards targeted harmful behaviors on white-box LLMs and include a naturalistic loss term, log-perplexity, to generate more natural language attacks for better interpretability. The proposed algorithm, Kov, trains on white-box LLMs to optimize the adversarial attacks and periodically evaluates responses from the black-box LLM to guide the search towards more harmful black-box behaviors. In our preliminary study, results indicate that we can jailbreak black-box models, such as GPT-3.5, in only 10 queries, yet fail on GPT-4$-$which may indicate that newer models are more robust to token-level attacks. All work to reproduce these results is open sourced (https://github.com/sisl/Kov.jl).


BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

Zeng, Yi, Sun, Weiyu, Huynh, Tran Ngoc, Song, Dawn, Li, Bo, Jia, Ruoxi

arXiv.org Artificial Intelligence

Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from >95% to <1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. Requiring only defender-defined safe and unwanted behaviors, BEEAR represents a step towards practical defenses against safety backdoors in LLMs, providing a foundation for further advancements in AI safety and security.


This is the future of generative AI, according to generative AI

Engadget

The prompt asked for a 1,200 word article (a number it undercut by quite a margin) that explored both the potential negative and positive outcomes of the technology. We then asked it to include real world examples, which is apparently beyond its capabilities. We also asked it to include a section on the recent Sam Altman debacle which, as you will soon read, was also not a topic it was particularly capable at describing. Below is the unedited output with light changes for formatting. Generative Artificial Intelligence (AI) has emerged as a powerful force, reshaping the technological landscape with its ability to create content autonomously.


Regulating AI: 3 experts explain why it's difficult to do and important to get right

#artificialintelligence

From fake photos of Donald Trump being arrested by New York City police officers to a chatbot describing a very-much-alive computer scientist as having died tragically, the ability of the new generation of generative artificial intelligence systems to create convincing but fictional text and images is setting off alarms about fraud and misinformation on steroids. Indeed, a group of artificial intelligence researchers and industry figures urged the industry on March 29, 2023, to pause further training of the latest AI technologies or, barring that, for governments to "impose a moratorium." These technologies – image generators like DALL-E, Midjourney and Stable Diffusion, and text generators like Bard, ChatGPT, Chinchilla and LLaMA – are now available to millions of people and don't require technical knowledge to use. Given the potential for widespread harm as technology companies roll out these AI systems and test them on the public, policymakers are faced with the task of determining whether and how to regulate the emerging technology. The Conversation asked three experts on technology policy to explain why regulating AI is such a challenge – and why it's so important to get it right.